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Towards Storytelling from 
Visual Lifelogging: An Overview 

Marc Bolanos*, Mariella Dimiccoli*, and Petia Radeva 


Abstract —Visual lifelogging consists of acquiring images that 
capture the daily experiences of the user by wearing a camera 
over a long period of time. The pictures taken offer considerable 
potential for knowledge mining concerning how people live their 
lives, hence, they open up new opportunities for many potential 
applications in fields including healthcare, security, leisure and 
the quantified self. However, automatically building a story from 
a huge collection of unstructured egocentric data presents major 
challenges. This paper provides a thorough review of advances 
made so far in egocentric data analysis, and in view of the current 
state of the art, indicates new lines of research to move us towards 
storytelling from visual lifelogging. 

Index Terms —visual lifelogging, egocentric vision, storytelling 

1. Introduction 

IFELOGGING consists of a user continuously recording 
their everyday experiences, typically via wearable sensors 
including accelerometers and cameras, among others. When 
the visual signal is the only one recorded, typically by a 
wearable camera, it is referred to as visual life logging. This 
is a trend that is rapidly increasing thanks to advances in 
wearable technologies over recent years. Nowadays, wearable 
cameras are very small devices that can be worn all-day long 
and automatically record the everyday activities of the wearer 
in a passive fashion, from a first-person point of view. As 
an example. Fig. shows pictures taken by a person walking 
down a street while wearing such a camera. 

Most wearable cameras on the market like GoPro, MeCam, 
Looxcie or Google Glass (see Fig.[^(a) and (c)) are video cam¬ 
eras, which have relatively High Temporal Resolution (HTR) 
(e.g. from 25 up to 60 frames per second) and are more suitable 
to record specific moments, such as cooking or doing sports. 
A limited number of wearable cameras, such as Narrative 
Clip and SenseCam (see Fig. (b) and (d)) are photographic 
cameras, which have Low Temporal Resolution (LTR) (2-3 
frames per minute), and hence are more suitable for acquiring 
data over long periods of time. On the one hand, data recorded 
at specific moments with video cameras offer potential for 
in-depth analysis of daily or special activities, allowing to 
capture even how something happened. On the other hand, data 
acquired over long periods of time, commonly called visual 
lifelogs, offer considerable potential for inferring knowledge 
about e.g. behaviour patterns, and hence enable many appli¬ 
cations that would not be possible with HTR cameras. As 
shown by Doherty et al. (321, visual lifelogs captured through 
a SenseCam, which as opposed to video cameras can capture 
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Fig. 1. Example of a sequence acquired by the Narrative Clip wearable 
camera while the user is walking down a street. The temporal leaps between 
neighbouring pictures produced by photographic cameras are common in 
dynamic environments and make the extraction of information from closely 
spaced images very difficult. 



(a) (b) (c) (d) 

Fig. 2. Examples of wearable cameras on the market: (a) GoPro (2002). (b) 
SenseCam (2005). (c) Looxcie (2011). (d) Narrative Clip (2013). 

the whole day, could be used to prevent non-communicable 
diseases associated with unhealthy trends and risky profiles 
(such as obesity or depression, among others). Additionally, 
they could also help prevent cognitive and functional decline 
in elderly people (29l, EH, EH. However, visual lifelogs 
present a significant challenge for automatic visual analysis. 
Indeed, due to the free motion of the camera and to its 
LTR, abrupt changes in lighting conditions and image content 
are very frequent (see Fig. [T]). In such situations, computer 
vision techniques based on temporal coherence and motion 
estimation become unreliable. Recognition algorithms have to 
cope with the huge variety of objects that appear. In addition, 
due to the non-intentional nature of the pictures captured, they 
generally contain severely occluded objects, artefacts such as 
blurring or light saturation (8^ and a large number of non- 
informative images that capture non-meaningful information 
such as walls, the sky, parts of objects, etc. Furthermore, the 
sheer number of data that a visual lifelog consists of and the 
rate at which they increase (up to 2,000 images per day or 
around 800,000 images every year) imposes a need for efficient 
methods to extract and locate relevant content concerning the 
wearer from the photo stream. Regarding HTR cameras, if 
they were employed for a lifelog analysis, the problem of 
the amount of data would be even more acute, and would 
additionally imply the need of huge computational resources. 

In response to the challenges and opportunities introduced 
by analysis of visual lifelogs, and more generally, by wearable 
cameras, computer vision scientists have rapidly become more 
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interested in the subject over recent years. By searching 
for the keywords egocentric vision, first person vision, ego 
vision and visual lifelogging, using Google Scholar, DBLP 
and visionbib.com, we found 274 papers in total devoted to 
visual lifelogging. For each of them, we annotated the type 
of camera used in the study and generated the plot in Fig. 
which represents all the papers related to egocentric vision up 
to November 2015. As can be seen, interest grew very fast in 
the last years and the number of papers published increased by 
over 50% in 2014 alone. Dotted lines show the comparatively 
small amount of work devoted to the analysis of image streams 
captured by photo cameras. This trend seemed to temporally 
change from 2007 to 2010, when the popularity of SenseCam 
resulted in a growth in the use of photo streams. 

An additional indication of the interest in this emerging 
field is the fact that in the last years, four surveys of 
wearable cameras and egocentric vision have been published. 
One, written by Doherty et al. 1^ . focuses on explaining 
the ethical and data management issues that must be taken 
into account when developing some health-related application 
using wearable cameras. The second one, by Betancourt et 
al. da, provides a general perspective on egocentric vision 
and devotes most of its analysis to the egocentric camera 
hardware, egocentric datasets, augmented reality, algorithm 
types and feature types used in the literature from 1997 to 
2014. This analysis is focused on providing a historical per¬ 
spective of egocentric devices and their algorithms in addition 
to several ways of categorizing the existent papers in this 
field. The third one, which is a book by Gurrin et al. ED, 
focuses on data management and distinguishes between data 
storage, organisation and visualisation', while also provides 
an overview of potential applications. The fourth study, by 
Harvey et al. ED, the authors present their work from the 
perspective of providing an aid to human memory. They 
analyse the human memory mechanisms from a psychological 
perspective and propose a pipeline for enhancing it based on 
segmentation, context enhancement (recognising objects and 
people) and image retrieval. 

This paper focuses on addressing the question: How far 
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Fig. 3. Histogram of the number of research papers published per year related 
to egocentric vision. The different colours indicate how many papers used 
each kind of camera. The dashed blue and black lines make a less specific 
distinction, showing the number of studies that used photo (LTR) or video 
(HTR) cameras, respectively. 


are we from being able to automatically tell our stories using 
egocentric photo streams? The process of fully understanding 
the story behind the pictures is fundamental towards enabling a 
wide range of applications ED and user cases EH, especially 
related to health. As we explained, since these applications 
require observations over long periods of time, data should be 
acquired by photographic cameras (e.g. SenseCam, Narrative, 
etc.) instead of video cameras (e.g. GoPro, GoogleGlass, 
Looxcie, etc.). To this end, a thorough review of the published 
advances in egocentric data analysis is presented and research 
insights are provided. In contrast to previous surveys, we 
review and give details of studies that focus on both photo¬ 
graphic and video cameras, considering which aspects should 
be reformulated and modified for their applicability in the LTR 
domain, and thus for egocentric storytelling. 

To summarize, our contributions are as follows: 

• Review of methods for acquiring, organizing, summariz¬ 
ing and browsing large collections of unstructured data. 

• Organization of the available literature around the central 
questions necessary to address the storytelling problem: 
Was the user interacting with somebody? How?, Where 
is he/she?. When did the event occur? and What is the 
person wearing the camera doing?. 

• Highlights of the weaknesses and strengths of the re¬ 
viewed techniques with respect to their applicability to 
the LTR domain (at the end of each subsection). 

• Extensive analysis of the available datasets and source 
code related to the storytelling problems. 

• Open problems and challenges in the field of egocentric 
vision with the final goal of storytelling. 

The rest of the paper is organised as follows. In Section 
[n| we review the most important papers devoted to the task 
of acquiring, organising, summarizing and browsing large and 
unstructured collections of egocentric data. The solutions to 
these problems provide a basis to further analyse the data 
content, as in Section [nl| where we review papers that claim to 
construct semantic building blocks for storytelling. Concluding 
remarks about applicability to the LTR domain are given at the 
end of each subsection. In Section ||V| we summarize the avail¬ 
able egocentric datasets with the corresponding annotations, 
as well as the egocentric vision software. Finally, in Section 
|Vj we draw our conclusions and give some possible future 
directions for the research necessary to fill the gap between 
raw egocentric data analysis and visual storytelling. 

H. Visual lifelogging acquisition, segmentation 

AND SUMMARIZATION 

This section reviews the literature concerning acquiring, 
structuring and summarizing visual lifelogging data, which is 
summarized in Table HI 

A. Data Acquisition 

The positioning of a wearable camera is of crucial impor¬ 
tance for lifelogging data acquisition from the point of view 
of its later application. Mayol-Cuevas et al. 1661 evaluated, 
partially through simulations on a 3D facet model of the 
human body, four attributes of optical devices with respect 
to their position on the wearer’s body: social acceptability. 
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Summary of all the visual lifelogging papers reviewed in this 

SURVEY RELATED TO ACQUIRING, ORGANIZING, SUMMARIZING AND 
BROWSING LARGE COLLECTIONS OF UNSTRUCTURED DATA. 
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absolute field of view (FOV), resilience to body motion, and 
view of the handling space region. That study concluded that 
wearable cameras placed on the chest are the most socially 
acceptable and therefore offer the advantage of not interfering 
with social interactions. In addition, they are relatively resilient 
to the disturbances introduced by the wearer’s own motion 
and are closely linked to the user’s workspace, since they 
allow visualisation of the manipulative space in front of the 
wearer’s chest. However, the FOV is quite narrow and does 
not allow the focus of the wearer’s attention to be modelled. 
In contrast, cameras worn on the head have a wider FOV and 
do allow this attention to be modelled, but they are the most 
sensitive to the wearer’s motion and suffer from low social 
acceptability. A compromise between the size of the FOV, 
accessibility to the handling regions, sensitivity to ego-motion 
and social acceptability is offered by wearable cameras placed 
on the shoulder. The authors also considered the possibility of 
wearing multiple devices on different parts of the body so 
that their FOVs would be complementary, with the joint FOV 
computed as the union of the individual FOVs. 

Remarks: Since for long-term image acquisition social 
acceptability is crucial, placement on the chest is usually 
considered the best choice. In addition, it has the advantage 
of offering access to the handling space and the manipulation 
of objects can be focused. 

B. Informative Image Detection 

Once images have been acquired, before proceeding with 
any structuring, analysis and summarization, proper cleaning 
of the images is necessary. This need stems from the fact 
that egocentric images are non-intentional images, that is, 
nobody decides when and of what to take a picture. As a 
result, a significant number of images can be blurred, can be 
dark, or can capture non-informative data (the sky, the ground, 
walls, etc.). In Xiong and Grauman 1^ , informative images 
are defined as ’’intentional” images, obtained once those with 
undesired artefacts, such as light saturation, blurred images, or 
useless information (the sky, walls, etc.) have been removed. 
Lidon et al. ISD define as informative any image that includes 
objects and/or people, and which is of reasonable quality. 
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Shopping Working Ride a bike Dinner time 


Fig. 4. Example of the desired event segmentation applied to lifelogging 
data. The goal is to group with respect to their main event, considering the 
activities, objects or people involved. 

assuming that it does not include any undesired artefacts (e.g. 
blurring, darkness or occlusions). With this definition, they 
trained a binary CNN to make this distinction. 

C. Temporal Segmentation 

Lifelogging data typically consist of long unstructured 
videos or photo streams. Organising and structuring them into 
homogeneous temporal segments, corresponding to different 
events and/or environments (see Fig. |^, are very important 
to facilitate browsing and analysis of the images. State-of-the- 
art methods for egocentric data segmentation can be classified 
into two broad classes depending on whether the homogeneous 
segments represent what the wearer sees or does. 

The former class uses features that can capture the char¬ 
acteristics of the environment around the wearer as image 
representation. Early work aiming at segmenting the sequences 
into visually homogeneous segments was based on low-level 
features. Li et al. IbOl have proven that it is possible to dis¬ 
tinguish different events simply by treating SenseCam images 
as time-series data and calculating the eigenvalue peaks in 
consecutive windows of images. Doherty et al. 1^ . 1281 used 
different descriptors for image representation and the metadata 
available from the camera sensors. Lin and Hauptmann 16^ 
proposed a simple approach based on using colour features 
in a time-constrained K-means clustering algorithm, capable 
of maintaining temporal coherence on the splitting of events. 
Spriggs et al. in proposed a method for simultaneous 
temporal segmentation and recognition of activity related 
to cooking. They captured videos at the same time from 
a single wearable video camera and multiple other static 
cameras, sensors, microphones, etc., and used both sensor 
data and visual GIST descriptors to describe the frames. For 
the unsupervised scene segmentation, they applied a Gaussian 
mixture model. More recently, Talavera et al. |[88l proposed 
the use of CNNs computed on the whole image using AlexNet 
as a fixed feature extractor for image representation. That 
work, designed for egocentric photo streams, uses a graph- 
cut algorithm to temporally segment the photo streams and 
includes an agglomerative clustering approach with concept 
drifting methodology, called ADWIN. 

Methods focusing on what the camera wearer does mostly 
use motion information as image representation. Usually, 
optical fiow is used to distinguish between static, moving the 
head/camera and in-transit frames Ha, El (see Fig. 13 . To 
focus on long-term ego-activities, Poleg et al. d proposed 
the use of so-called integral motion, which is closely related 
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Fig. 5. Motion-based segmentation framework proposed in na. By including 
motion features to describe egocentric pictures they can separate the events 
considering the dynamism of the activities performed. 

to the wearer’s activity. By integrating the instantaneous dis¬ 
placements at fixed image patches, the variations due to head 
rotation are eliminated, since their mean is practically zero, 
leaving only the consistent displacement caused by forward 
motion. A different approach, based on CNNs, is adopted by 
Castro et al. 1^ . They gathered a large egocentric dataset 
from a single user and fine-tuned a CNN pre-trained on 
ImageNet for activity classification. They proved that the 
network trained on the data of a single user can be re-trained to 
generalise to new users. The main problem with this approach 
is that a new set (several thousands of images) must be labelled 
from scratch whenever it is necessary to predict the events 
affecting a new user with the model. 

• Remarks: The applicability of motion as a feature, though 
relevant when dealing with videos, has proven to be rather lim¬ 
ited for photo streams. In the latter case, the use of richer rep¬ 
resentations, such as global CNN-based features, seems crucial 
to compensate this limitation. The use of time-dependent 
methods for egocentric segmentation is also a must considering 
the nature of the data. A promising approach to improve 
the results of the segmentation of egocentric sequences is 
the addition of semantic-level features (scenes, objects, social 
interaction, actions, etc.). This additional information would 
be an important step to bring machine segmentation closer to 
the way humans segment unconstrained streams of images. 


D. Egocentric Summarization 

Summarization is the process of generating a proper, com¬ 
pact and meaningful representation 1901 of a given sequence 
through a subset of representative frames or segments. This 
step is crucial to help manage and browse large volumes of 
lifelogging video content efficiently. Basically, there are two 
kinds of summaries that can be produced: a static video story 
board, which is composed of a set of salient images extracted 
or synthesised from the original sequence, and dynamic video 
skimming: a shorter version of the original video made up of 
several shots, comprised of a series of frames. To fully exploit 
the potential of visual lifelogs in a variety of applications, an 
egocentric summarization method should be designed to aid in 
the visualisation, indexing and browsing of autobiographical 
events, with the least possible semantic loss. 

Story board summarization has been traditionally formu¬ 
lated as grouping images into coherent collections by relying 
on low-level spatio-temporal features and then selecting the 
most representative image (or set of images) from each col¬ 


lection 1^ . Based on this classical approach, Jinda-Apiraksa 
et al. 1491 and Chowdhury et al. 1^ . developed similar tech¬ 
niques for keyframe selection in egocentric sequences based 
on quality measures l49l , l25l and both quality and diversity 
measures 1^ . More complex features for grouping were used 
by Bolanos et al. ca. Their methodology, adapted for photo 
cameras, uses the AlexNet CNN as a feature extractor to 
characterise each frame. Then, using those features, they apply 
event segmentation using a hierarchical clustering algorithm 
and a posterior single keyframe selection by applying the 
Random Walk algorithm to each of the segments. 

While these methods rely solely on low-level features, 
some recent work has introduced a semantic level in the 
keyframe selection process. Ghosh et al. ioi suggested that 
video summarization should be driven by the presence of 
important people and objects. Following this idea, they pro¬ 
posed a method that reveals salient people and objects based 
on their interaction time with the camera wearer and then 
selected keyframes according to key object event occurrences. 
Lu and Grauman |[64l . following on from their previous 
work, suggested that video summarization should preserve the 
narrative character of a visual lifelog and proposed a shot 
selection consisting of three terms: 1) a term that models story 
coherence by favouring shots capable of following the inherent 
story; 2) a term that models importance, to choose only shots 
that show some important aspect of the day; and 3) a term 
that models diversity and avoids repeating similar events. 

Summarization that considers semantic topics was recently 
proposed by Varini et al. ini and Schinasi et al. ED. In ED, 
it is assumed that interesting scenes in a cultural experience, 
such as visiting a museum, are those associated with certain 
patterns of behaviour of the camera wearer that are learned 
and used for classification. Taking into account the topic of 
interest of the user, different summaries can be generated from 
the same video. In ED, topics are revealed from a set of social 
media messages as highly connected messages in a graph, 
whose nodes encode messages and whose edges encode their 
similarities. Finally, the images that best represent the topic 
are selected based on their relevance and diversity. Lidon et 
al. ED, also working on photo sequences, proposed an event 
keyframe ranking method based on a trade-off between image 
relevance and diversity after removing non-informative images 
(containing undesired artefacts, e.g. blurring, darkness or oc¬ 
clusion, or showing the sky, walls or object parts) by using 
a new binary CNN-based filter. Their relevance criteria took 
into consideration several semantic measurements, including 
whether faces and/or objects were present, as well as whether 
the images had a high saliency value. 

• Remarks: A semantic-oriented approach to egocentric 
summarization seems to be the most suitable for lifelogging 
data. Indeed, users would ideally search for complex autobi¬ 
ographical events that encompass simpler human actions and 
may not be directly correlated with their visual appearance. 
When dealing with photographic cameras, and due to the 
nature of their data, the only possible way to tackle the 
summarization problem is through the keyframe selection ap¬ 
proach. Taking this into account, methods like ll64]| should be 
reformulated, either considering the video sub-shots as single 
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frames, or developing a fine-grained segmentation procedure. 
This procedure should separate the data into a large number of 
events to have enough segments to apply the sub-shot selection 
correctly. 

E. Content-Based Search and Retrieval 

Retrieving images from a large personal database allows us 
to browse, search and find images of previously seen objects 
or places and thereby has the potential to solve a broad range 
of problems in egocentric vision, such as: 

• searching for elements (Have I seen this before?); 

• navigating (How often do I visit this place?); 

• understanding the environment (Where am I right now?); 

• efficiently organising huge amounts of data. 

Following these premises, in 1941 Wang et al. built a 

system for content-based searching and browsing that starts 
by splitting the stored data into segments and extracting three 
kinds of information: 1) time and other relevant attributes, 
2) low visual features, and 3) audio features. Then, in the 
retrieval step, they applied time-based filtering by comparing 
the time attributes of the images in the database with the 
query introduced by the user. A clustering step then extracts a 
representative clip from each cluster; and finally, the user can 
provide one or more query images for the system to refine 
the search based on visual features and improve the query 
result. Still, several open issues remain: in many situations it is 
difficult to recall the time and where the photo we are looking 
at was taken; visual features are too simple to capture real 
object shape and texture differences; and furthermore, audio 
features are not provided by all wearable devices. Aghazadeh 
et al. m proposed to retrieve novel scenes and actions with 
respect to a previously acquired egocentric dataset by using a 
set of ’’alignment” sequences, and matching them with a new 
’’query” sequence by using dynamic time warping. 

Assuming that searching, browsing or summarization in 
visual lifelogging would largely benefit from semantic concept 
representation, Wang and Smeaton ll93]| investigated the selec¬ 
tion of the most appropriate combination of concepts for event 
representation. Their strategy basically consists of reasoning 
on semantic networks using a density-based approach. Min 
et al. 1241, EEl represented millions of egocentric images on 
a sparse graph. They represented each image as a node in 
the graph, and added an edge between two nodes, when they 
belonged to the same bag in a BoW representation. Relying on 
this representation, they showed that local density clustering 
is more suitable than global clustering methods, considering 
the high redundancy that lifelogging data inherently possess. 

• Remarks: Many issues remain regarding content-based 
retrieval techniques, for instance: How can we make use of 
the basic building blocks extracted from lifelogging (actions, 
people and environments)? The usage of a multi-level and 
multi-modal descriptions based on the recognition of actions, 
people, objects and environments could provide a detailed 
image description close to text-level, which could allow high 
retrieval accuracy. 

In methods such as 1^ . 1^ . new challenges would arise 
when dealing with photo data, considering the higher variabil¬ 
ity of consecutive images compared to video sequences. 


HI. Visual Lifelogging Analysis 

We present an overview of the most important papers on 
visual lifelogging analysis and the problems they tackled, 
organised around four basic questions: Is the user interacting? 
How? Where is the user? When are the events occurring? and 
What is the user doing?. Table [II] lists the papers and related 
information. 


TABLE II 

Summary of all the visual lifelogging analysis-related papers 

REVIEWED. 


VISUAL LIEELOGGING ANALYSIS 
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III-D4 

Other Approaches 
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A. Interacting? How?: Social Interactions 

Following the definition by Rummel ESI, social interac¬ 
tions are all acts, actions or practices of two or more people 
mutually oriented towards each other. Given the powerful 
social nature of humans, the analysis of social interactions in 
lifelogging data is of fundamental importance to understanding 
human behaviour. Furthermore, the presence of people and 
social interactions are consistently associated with event mem¬ 
orability Il2l and therefore, their detection is also potentially 
useful for keyframe extraction or to estimate the importance 
of events in a lifelog ED. From the perspective of computer 
vision, social interactions can be characterised by patterns 
of attention between individuals. Analysing attention patterns 
requires the detection, tracking and locating of people in 
3D environments. Indeed, when interacting with others, we 
naturally tend to place ourselves in certain positions so as 
to stand close to those we interact with and avoid occlusions. 
F-formations E3 have been demonstrated to be a suitable for¬ 
malism for modelling social interaction behaviour. Following 
the original definition by Kendon El: 

An F-formation arises whenever two or more people 
sustain a spatial and orientational relationship in 
which the space between them is one to which they 
have equal, direct, and exclusive access. 
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Fig. 6. Different arrangements of F-formations that are useful for social 
interaction analysis: (a) circular arrangement, (b) vis-a-vis arrangement, (c) 
L-arrangement. (d) side by side arrangement. Image adapted from HU. 



Fig. 7. Example of multi-face tracking obtained by applying the method in 
[I] to track multiple faces in LTR sequences captured by a wearable camera. 
Each row represents the track of a different person. 

Examples of F-formations are given in Fig. The F- 
formations theory has been successfully applied in social 
interaction analysis 1461 using classical videos or still images, 
and more recently to egocentric videos |6l. Head estimation 
and 3D location are crucial for the detection of F-formations. 
Indeed, a rough estimate of someone’s head pose allows us to 
understand with a certain precision what the person is looking 
at; while it is important to estimate the distance people have 
from the camera wearer and other people if there is interaction. 

In sequences captured through a wearable camera, pose 
estimation is a challenging task due to the continuous changes 
of aspect ratio, scale and orientation. A common way to 
address this problem 0, a, o, Ea is to assume that 
where a group interacts in a discussion, the head of each 
person will be oriented for a while towards the person who is 
speaking, and to use a model to capture this behaviour over 
time. Generally, in video sequences, this is achieved through 
a hidden Markov model or Markov random fields, where the 
latent variable corresponds to the head pose and the observed 
variables to the results of a multiple person tracker, applied 
to the input images. The only works devoted to the analysis 
of photo sequences are Cl, O, lO. In this context, tracking 
people is very challenging due to the abrupt and very frequent 
changes of view. The proposed approach basically consists of 
computing backward and forward correspondences for each 
face detected in the sequence and of grouping similar tracklets 
into bags, which should correspond to different people (see 
Fig-IZl'. A combination of first-person and third-person views 
is considered by Soo and Shi ifSSl to predict social saliency, 
considered as the likelihood of joint attention, in real-world 
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scenes with multiple social groups. This is basically achieved 
by modelling social formation features that encode the geomet¬ 
ric relation between the joint attention and spatial distribution 
of the members of a social group. 

• Remarks: In general, there is common agreement about 
the need to track people, head orientation and 3D locations to 
detect F-formations that represent social groups in egocentric 
sequences; however, two fundamental problems arise. First, 
since in different social scenarios, distances and poses can as¬ 
sume different degrees of significance, clearly a need emerges 
for an algorithm to be able to adapt to different situations and 
learn how to treat distance and orientation features depending 
on the context. As a consequence, the choice of which data 
to use for training is crucial. Second, distances and poses 
strongly depend on where the camera is worn (eyeglasses, on 
the head, on the neck, etc.). Except m, a, all the methods 
mentioned above rely strongly on temporal coherence, since 
they were conceived for video sequences. Further advances in 
the analysis of social interactions through photographic cam¬ 
eras would require us to focus on features that are less sensitive 
to changes over time, such as people’s body movements, which 
are consistently associated with emotional experiences m 
and could, therefore, be considered cues of social interactions. 

B. Where? Scene Understanding 

To answer the question ’’Where is the user?”, we require 
a semantic understanding of the elements that surround the 
camera wearer, such as objects, people and environments, 
since they represent the cues available to recognise his/her 
surroundings. In this section, we provide an overview of 
computer vision tasks related to scene understanding, such 
as object recognition, spatial localisation, scene parsing and 
scene recognition. All of them share the goal of determining 
what the most promising techniques are for understanding 
scenes in lifelogging data. 

1) Object Recognition and Object Discovery: Scenes can 
also be characterised by a vocabulary of concepts that can 
be found in them. With this aim, we consider the following 
problems: object recognition, which intends to identify the 
category that a given object belongs to; and object discovery, 
which detects, recognises and reveals new objects in images 
that possibly have never been seen before by the algorithm 
in the previous images. Due to the free motion of the camera 
and to the passive acquisition of lifelogging data, objects are 
frequently occluded and their appearance may vary broadly. 
Thus, the object recognition problem in egocentric data is 
becoming a challenging and active research field. The first 
work on object recognition in the domain of lifelogging is 
by Byrne et al. GD, who successfully validated supervised 
concept recognition, referring to relevant objects or scenes 
as concepts. Furthermore, using the output of the detector, 
they showed that the images that compose a lifelog collection 
tend to be temporally consistent in their visual properties, as 
well as in the concepts they contain. Because of this concept 
consistency, they suggested that an efficient automatic extrac¬ 
tion and inference of higher-level semantic concepts based on 
co-occurrences and known relationships would be feasible. 










JOURNAL OF TRANSACTIONS ON HUMAN-MACHINE SYSTEMS JULY 2015 

Bolanos et al. uni developed an active labelling method to 
generate a sufficiently large number of training examples to 
train an efficient supervised classifier. The method, based 
on a combination of hierarchical clustering trees, uses an 
unsupervised learning algorithm to organise the data, selecting 
the most informative part, asking the user for their labels, and 
using the feedback provided to improve the classification in 
a semi-supervised way. Ren et al. in ll75l . ll74ll and Fathi et 
al. in used head-mounted cameras and proposed methods 
that recognise objects held in the user’s hand. They segmented 
the background from the foreground (hands and objects) using 
optical flow features and relying on the fact that foreground 
objects will usually move in a more dynamic way while the 
background is more static. 


Fig. 8. Examples of objects revealed by the ego-object discovery methodology 
fTbl for two different subjects (one per row). Better viewed in digital format. 

Focusing on the task of object discovery in lifelogging data 
(see example in Fig. [^, Kang et al. 1(5^ proposed a method, 
starting from an initial segmentation, that clusters only samples 
with higher correlation that should belong to the same object 
type. To this end, starting from the initial segmentation, they 
provide a merging strategy for segments that closely co-occur 
in most images. In this way, they complete objects that might 
be composed of different, but clearly defined parts (e.g. a 
laptop composed by a screen and keyboard). With the same 
goal, Bolanos et al. in HU, ca proposed the use of a state-of- 
the-art objectness detector and a pre-trained CNN specialised 
in object recognition to extract a set of rich features for each 
object candidate followed by clustering them. The clustering 
integrates a ’’Bag of Refill” strategy of previously discovered 
object instances as a knowledge reuse methodology. 


Fig. 9. Example of the result obtained (top) by applying a scene parsing 
algorithm to a conventional non-egocentric image (bottom). We can see the 
different segments found (separated by different colours) and the classes 
assigned to each of them. Picture adapted from (33). 

2) Spatial Localisation: Bettadapura et al. in Ha , proposed 
a method called FOV localisation that combines localisation 
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techniques with egocentric images to localise the user(s) in the 
environment. To do so, they used a reference dataset, which 
can be images from Google Street View or pre-recorded videos 
from fixed cameras, and matched them to the data acquired 
by the user’s photographic or video camera to obtain his/her 
localisation. They tested the system on multiple datasets 
captured indoors and outdoors. Additionally, they proposed 
a combined FOV localisation system for simultaneous locali¬ 
sation of multiple users of wearable devices. Wannous et al. in 
1951 also proposed a methodology for localisation and action- 
related event recognition. They used a shoulder-mounted video 
camera to acquire images of daily indoor living (e.g. kitchen, 
office, library, etc.) and built a 3D model of the different 
scenes. In their work, they proved that their models were 
more powerful than simpler 2D ones and were able to recover 
information from previously seen scenes with query images. 

• Remarks: Another interesting approach that egocentric 
vision could benefit from is scene parsing. This is based on 
image segmentation; that is, separating out all the regions in an 
image that belong to different objects or regions. Furthermore, 
these kinds of techniques classically consist of providing pixel- 
level segmentation of the whole image and at the same time 
assigning an object class to each of the pixels (see the example 
of scene parsing in Fig. [^. To do this, most of the methods 
use pixel-level classifiers to achieve an initial segmentation 
and then a graphical model is applied to smooth and correct 
the boundaries of the segments (3^ . ll98ll . A limited amount 
of work in this field can be found in the literature but none 
of it was specifically designed or tested on egocentric and 
lifelogging datasets. Considering the differences we could 
find in an egocentric dataset (and more precisely in lifelogs) 
with respect to those typically used in scene parsing, we 
can enumerate some clear points to take into account when 
working on scene parsing: 

• Scene parsing datasets are usually composed of natural 
and urban scenes (in general, outdoors) and their cor¬ 
responding class distributions have a high percentage of 
training samples related to those environments, that is, the 
egocentric lifelogging datasets for scene parsing would be 
very different considering the indoor and routine settings 
where people usually spend most of their time. 

• Also taking into account the fact that egocentric vision 
datasets are composed of routine and redundant scenes, 
scene parsing methods focusing on lifelogging images 
should provide some higher context and knowledge-reuse 
mechanisms to take advantage of the previously parsed 
images in the egocentric sequence. 

Related to scene parsing, it would also be useful to be able 
to recognise the scene the user is in. Although no work has 
been presented with this purpose using egocentric data, a good 
example with conventional images is the dataset Places205 
1 1001 . This information could help when deciding, for in¬ 
stance, how we should segment the day into events or use this 
information to exploit the environment-object relationships. 

Although good methodologies have been proposed for ob¬ 
ject recognition and object discovery using egocentric and 
lifelogging images, there is still a lot of work to do to 
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semantically describe the camera wearer’s environment at 
a high level. The development of object detection methods 
specifically designed for egocentric images could not only 
improve existent recognition and discovery methods, but also 
set a more robust basis for the future appearance of scene 
parsing of lifelogging images. To achieve these goals, new 
computer vision techniques able to cope with blurring, light 
saturation and the occlusion of objects have to be developed. 
Hence, new techniques for gathering huge labelled datasets 
not only for object detection, but most importantly for scene 
parsing, must be developed. Furthermore, the addition of 
GPS or visual localisation techniques to scene parsing could 
clearly improve understanding of the environment. The most 
promising technique applicable to scene parsing is using Fully 
Convolutional Networks 1631 , which are able to infer the 
classes of each pixel treating the image as a whole instead 
of the current pixel-level centred classifications. 

Finally, note that all the work on object recognition relies 
on the user-like focus and point of view that head-mounted 
cameras offer. This approach would not be feasible for real 
applications, where neck hanging cameras are usually used 
because they are considered less obtrusive and more user- 
friendly 1(431 , despite not always being able to show what the 
user is doing. Moreover, these algorithms, which rely heavily 
on temporally close video frames and motion information, 
would not be applicable to LTR photographic cameras either. 

C. When? Time-Based Localisation 

Time information is particularly important to determine the 
causal relations in human behaviour. For instance, it could 
be useful in understanding which factors determine crises 
in people affected by bipolar disorder. The most common 
annotation tool used for keeping a record of the time in 
lifelogging data is the time stamp provided by cameras. By 
using this information, one can easily establish the temporal 
placement of the data in the long term, the order of the images, 
and their temporal distance for photographic cameras in the 
short term or daily. Some works have studied incorporating 
temporal information as a complementary feature indicator 
for achieving an indirect prediction. As an example, in 16^ . 

the authors have treated the data acquired as a time- 
series to properly segment the different events present in a 
day. In 1^ , both the day of the week and the time of the 
day have been used for training a classifier with the ability 
to categorise different events. Naaman et al. 1691 studied the 
role of the time stamp as a memory cue in a psychological 
experiment on conventional images and concluded that people 
are unable to retrieve their memories when only given the time 
and date; consequently, additional information is needed for 
retrieval methods to be effective. 

D. What? Action Recognition 

Inferring what the camera wearer is doing from a visual 
lifelog basically requires the categorisation of everyday ac¬ 
tivities. The categories to focus on depend on the kind of 
application. For instance, in healthcare and well-being appli¬ 
cations, occupational therapy research may guide the selection 


of the target activities and related concepts (see Fig. as an 
example of sports category recognition). For diet monitoring 
applications, eating actions will be the focus; whereas in 
applications related to the diagnosis of dementia, the focus 
will be on daily life activities such as dressing, making coffee 
and cooking. In quantified-self applications, activities like 
housework, watching TV, working/studying, eating/drinking, 
etc. are the most prevalent activities. 

Traditional action recognition methods can be broadly clas¬ 
sified depending on the kind of features they use to represent 
actions; with body movement analysis and the use of the 
objects involved in the action being the most common choices. 
Only very recently has the scene context been used to improve 
action recognition. Still, the choice of the representation 
strongly depends on the kind of actions to be classified. 



Fig. 10. Examples of first-person point of view images performing various 
sports. Image adapted from ESI 

1) Body movement-based methods: In an egocentric setting, 
general body movements such as running, walking, moving the 
head/camera or staying still are usually estimated relying on 
motion features (when this is possible with the temporal reso¬ 
lution of the camera). Usually, based on such features, the ego- 
action classification can also be used for event segmentation. 
Typically, video cameras like GoPro, which capture around 
30 fps, are used to gather data. Poleg et al. |73l proposed 
integrating instantaneous displacements of fixed image patches 
over a long period of time to remove the zero mean variations 
due to head rotation. By applying this process, they leave only 
the consistent displacement caused by forward motion. The 
cumulative displacement curves show different patterns for 
ego-motion activities, so that activities become easy to classify. 
Instead of focusing on the goal of building discriminative 
motion features, Kitani et al. |[56ll used several modifications 
of classical motion-based feature vectors and built a complex 
Bayesian model for clustering. 



Fig. 11. Examples of first-person point of view images for recognising 
activities involving hands. The algorithm is capable of detecting the left and 
right hand of the user, in pink and light blue respectively; and the left and 
right hand of the person he/she is interacting with, in dark blue and green, 
respectively. Image adapted from (HI, devoted to hand disambiguation. 

2) Object-hand interaction-based methods: A first-person 
point of view offers an ideal perspective from which to 
analyse hand-object manipulation or hand-eye coordination 
(see Fig. [nj. The main idea, introduced by Fathi et al. Ill 
and further improved by ifTOl . fTTIl , ISTll is that objects are 
correlated with actions (e.g. dish and nibbling) and actions 
with activities, and these correlations can be exploited to 
build robust object models. However, the challenges come 
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from additional occlusions (from manipulated objects, or self¬ 
occlusions of fingers by the palm) and the fact that hands 
interact with the environment and often leave the camera FOV. 

Others have focused on different problems related to hand- 
object manipulation such as capturing the variability of hand 
appearance over a diverse set of imaging conditions and hand 
poses (591, disambiguating and tracking the observers hands 
and those of social partners (5^ . improving robustness against 
camera motion ii74i . ca, di, or capturing the appearance of 
visual composites of humans and objects in interaction llD. 

3) Attention-based methods: The use of manipulation- 
based approaches is restricted to scenes and objects where 
the user’s hands present significant information. Attention- 
based approaches aim to identify objects to which the user 
pays particular attention, even in the absence of manipulation, 
since they could be key factors in self-behaviour recognition. 
In general, these methods are applicable to data acquired by 
head, eyeglass or ear-mounted cameras only. Attention can be 
used to find salient objects as in Matsuo et al. (651 . or to 
capture the relationship between action and gaze, as in (3^ . 

4) Other approaches: To detect activities that cannot be 
fully characterised by body movement, object-hand manipu¬ 
lation or object-gaze relationships, motion has been the most 
commonly used feature. Instead of trying to compute ego- 
motion, these approaches describe the frames that compose the 
actions, they use a set of motion and visual word features in 
a local (on a single frame) and global (on a set of consecutive 
frames) manner and create a specific structure for obtaining 
a temporally and spatially consistent representation of the 
action. Song et al. (84l obtained an accuracy rate of activity 
recognition of about 80% using the dataset they published 
(LENa dataset), by adopting the dense trajectory approach. In 
( 23 , the authors used a wearable video camera to capture and 
recognise a diverse set of actions (e.g. throwing, hand shaking, 
hugging or waving) which, in this case, is made by other 
people towards the camera user. Recently, a newer approach 
for action recognition was proposed by the same authors in 
(80l . On this occasion, they used CNN features to describe 
the frames of an HTR video. To obtain a rich and motion¬ 
like representation, they then proposed the use of a temporal 
pooling operator (PoT). An interesting alternative to motion 
was proposed by Yan et al. (97]| . who exploited the fact that 
typically people tend to perform the same actions in the same 
environment (e.g. people at work typically have a coffee break) 
and their results show the advantage of sharing information 
between tasks. Kanade et al. (SD explored the problem of 
activity recognition from a deeper perspective. They proposed 
several methods for activity recognition, some based on object 
and scene understanding, which are specifically adapted to 
their eye-glass mounted wearable device. 

• Remarks: In essence, the most common cues on which 
activity recognition in egocentric videos relies on are body 
movement, object-hand interaction and patterns of attention. 
Body movement-based methods rely on motion estimation 
and therefore are not directly applicable to data acquired by 
photographic cameras. Object-hand interaction and patterns of 
attention are feasible for data acquired by wearable cameras 
attached to the head or somewhere near the person’s eyes 
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that could follow his/her gaze. However, when the camera 
is worn as a necklace or attached to the clothes, attention- 
based methods fail, making it impossible to see what the 
user is manipulating and making it very difficult to estimate 
the centre of attention. Similarly, object-hand methods can 
be very difficult to apply considering the free motion of the 
camera and the difficulty in regularly showing the hands of 
the user. To the best of our knowledge, there is no published 
work on recognition of egocentric activities recorded by freely 
worn cameras. In this context, it would be a requirement for 
robust activity recognition to take into account information 
concerning whether the camera wearer is stationary or moving. 

IV. Availability of datasets and software 

A. Egocentric Vision Datasets 

As egocentric vision is a relatively new research field, 
the creation of standardised and rich enough datasets and 
annotations to test and compare the new algorithms is crucial 
to boost the development of the field. In Table we provide 
a summary of currently available public egocentric datasets, 
specifying, for each of them, the following information: the 
name and the reference paper where the datasets were pre¬ 
sented or were used for the first time (where data can be 
found); a short description; the kind of annotated data they 
contain; and the camera used to acquire the data. 

Only two of the publicly available egocentric datasets, 
EDUB (T^ and AIHS (5Ql use photographic cameras, and 
thus, are useful to test and compare algorithms for visual 
lifelogging. Most of them are acquired using video (HTR 
cameras), making the analysis of long periods of time difficult. 
Although nearly all of them show scenes of daily living and 
some of them record many continuous hours of video (64l . 
(ZD, (23, there is a strong need to create rich datasets with 
detailed annotations to ensure the robustness, applicability and 
usability of the algorithms for visual storytelling construction. 

Following, we enumerate the available datasets (referenced 
by their main citation) for each of the relevant tasks applicable 
for analysing the main building blocks of lifelogging data: 

• Social interaction analysis: (Ml, (6), |70l 

• Object recognition/detection/discovery: da, ca, (33, 

ED, do), Ea, Ea 

• Gaze prediction: i3a 

• Hand detection/segmentation: E3, E), ED, ED, El, 

ED 

• Gesture recognition: (8], ll20l . Il22l 

• Activity recognition: 1 ^ . 1 ^ . ED, ED, EH, lHa, 
ii84i . ifToi . ESI, iia 

• Novelty or informative region detection: (61, (1 

This analysis reveals the lack of well-established and widely 
accepted datasets. 

B. Egocentric Vision Software 

The publication of the source code is crucial to guar¬ 
antee the reproducibility of research results and to allow 
quantitative comparisons on different datasets. To divulge 
available egocentric vision-related software, we present a list 
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of the most relevant repositories, including source code for 
object recognition, object discovery, activity recognition, event 
segmentation, keyframe-based summarization and informative 
image detection in Table |IV| 

V. Conclusions and Future Directions 

This review summarized the state of the art of visual 
lifelogging analysis from a storytelling perspective, focusing 
on the progresses made so far in this context in the field of 
computer vision. In the first part of this survey we reviewed 
several techniques for acquiring, organizing, summarizing and 
browsing large collections of unstructured data. In the second 
part, we organized the available literature around the central 
questions necessary to address the storytelling problem: Was 
the user interacting with somebody? How?, Where is he/she?. 
When did the event occur? and What is the person wearing the 
camera doing?. For each research question we highlighted the 
weaknesses and strengths of available methods with respect 
to their applicability to the LTR domain. Additionally, we 
reviewed all the available datasets and source code. 

Generally, from this review, we can draw some conclusions 
regarding the crucial points that must be followed in short¬ 
term research into egocentric vision. First, there is a need to 
develop more algorithms suited to data acquired through photo 
cameras, in particular for social interaction detection and anal¬ 
ysis, as well as for activity and context recognition. Second, in 
view of the large number of datasets made publicly available 
in the last few years, it would be useful to foster cooperation 
within the lifelogging scientific community to elaborate richer 
lifelogging datasets. By doing this, researchers could validate 
their algorithms and promote competition. Third, considering 
that visual storytelling has to preserve semantics, a promising 
direction is to continue leveraging semantic information for 
both egocentric data analysis and summarization. Given the 
wide variety of settings in which lifelogging cameras are 
being deployed, visual recognition could largely benefit from 
the use of ontologies. Moreover, this paper showed that the 
interest in analysis from the computer vision community over 
the last few years has increased considerably. In parallel, 
we witnessed a burst in the study and applicability of con¬ 
volutional neural networks, suggesting that expectations for 
making progress in the coming years are growing fast. This 
progress should be accompanied by the creation of larger and 
more consolidated datasets that will compensate the enormous 
data demand of CNNs. In particular, research efforts should 
focus on the problems of 1) developing more sophisticated 
transfer learning strategies able to reduce the need of large 
annotated datasets and 2) exploiting temporal coherence of 
concepts that characterize visual lifelogs. However, given the 
current limitations of CNNs in terms of computational cost and 
resources, the analysis would be limited to post-processing. 
Finally, a promising area of research that has not been ex¬ 
plored for storytelling via ego-vision yet, is text description 
generation from images. This problem, tackled for instance in 
1991 , |92l, consists of rendering a visual to text translation of 
what is happening in the images. The development of these 
new kinds of multi-modal techniques could open up a new 
area, full of potential for egocentric storytelling, in which we 
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could provide a human-like description of what happened in a 
precise scene or event. The application of these algorithms to 
the medical field, and more precisely to people with dementia, 
could help provide patients with a richer context to understand 
better what happened to them in a given situation. 
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TABLE III: Summary of currently available public egocentric datasets. 


Name 

Description 

Type of Annotations 

Camera 

Egocentric Dataset 

of the University of 
Barcelona (EDUB) CS) 

With 4912 images acquired by the wearable camera 
Narrative; divided into 8 different days which capture 
daily life activities like shopping, eating, riding a bike, 
working, etc. It was acquired by 4 different subjects, 2 
days each; and with 11,294 different object segmented 
instances from 21 different classes (TV, hand, person, 
car, sign, etc.). 

Object labels and 
segmentations 

Narrative 

All I Have Seen (AIHS) 

Eol 

Contains 19 days with a total of 45,612 images of 
640 X 480 resolution, containing around 15 recurrent 
places/scenes appearing like home rooms, work office, 
work building, supermarkets, playgrounds, campus, bik¬ 
ing trails, etc. 

Not available 

SenseCam 

Intel Egocentric Object 
Dataset ff5\ 

Has 10 video sequences (100,000 frames) from 2 sub¬ 
jects manipulating 42 different types of everyday object 
instances. 

Object labels 

and foreground 

and background 

segmentations 

PointGrey 

GeorgiaTech Egocentric 
Activities (GTEA) ET) 

The videos captured by a cap-worn camera show 7 types 
of daily activities, such as making a sandwich/coffee/tea, 
each performed by 4 different subjects. Each activity 
video is labelled with the list of objects involved; 
each frame has left hand, right hand, and background 
segmentation marks 

Objects list 

and hands and 

background 
segmentations 

GoPro 

GTEA Gaze+ Dataset 

EH 

With video and audio recordings of 7 meal-preparation 
activities such as making pizza/pasta/salad collected 
using eye-tracking glasses. Each activity was performed 
by 5 different subjects. Each frame has eye-gaze fixation 
data, and different activities such as opening fridge are 
annotated. 

Gaze and actions 
performed 

Tobii 

Eirst-Person Social In¬ 
teractions Dataset |[35l 

Day-long videos of 8 subjects spending their day at 
Disney World. The cameras are mounted on a cap 
worn by the subjects. Elan annotations containing the 
number of active participants in the scene, and the type 
of activity: walking, waiting, gathering, sitting, buying 
something, eating, etc. 

Actions performed 
and social 

interactions at 

each time period 

GoPro 

Huji EgoSeg Dataset 

1731 

With 29 videos captured by an egocentric camera anno¬ 
tated in Elan format. The videos (some from YouTube 
and others recorded by Hebrew University of Jerusalem 
researchers) contain various daily activities. 

Actions performed at 
each time period 

GoPro 

UT Ego Dataset l(64l 

Has 4 videos captured by a Looxcie wearable camera 
(head-mounted). Each video is about 3-5 hours long, 
captured in a natural, uncontrolled setting. The videos 
capture a variety of daily activities. 

Important regions 

annotation 

Looxcie 

Interactive Museum 

Dataset lIHl 

A gesture recognition dataset taken from an egocentric 
perspective in a virtual museum environment. It has 5 
different users who performed 7 hand gestures. 

Hand gestures 

No Infor¬ 
mation 
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VINST - Visual Diaries With 31 videos capturing the visual experience of a 
fll subject walking from a metro station to work. It consists 

of 7236 images in total. Each image is annotated with 
a location ID which covers 9 unique labels in total. 
Temporal segments corresponding to novel ego motions 
are annotated as well. 


Location and ’’novel 
ego-motions” anno¬ 
tations per frame 


No Infor¬ 
mation 


UCI Activities of Daily 
Living Dataset (ADL) 

im 


Has 1 million frames of dozens of people performing 
18 daily indoor activities such as brushing their teeth, 
washing dishes, or watching television, each performed 
by 20 different subjects. It includes annotations of 42 
object classes. 


Activities, object 

bounding boxes 

and classes, hand 
positions and 

interaction events 


EGO-HPE 0 A set of egocentric videos with different subjects for Eace orientation 

head pose estimation. Each video is annotated at the 
frame level for five yaw angle orientations (-75, -45, 0, 

45, 75) with respect to the subject wearing the camera. 


EGO-GROUP 0 A social group detector dataset for egocentric vision. People group com- 

which consists of 10 videos collected in different situ- position 
ations: a laboratory, a coffee break, a conference room 
and an outdoor scenario. 


GoPro 


Vuzix 

Smart 

Glass 


Vuzix 

Smart 

Glass 


JPL Eirst-Person Inter- Human activity videos taken from a first-person view- Actions performed in GoPro 

action Dataset CD point. The dataset specifically aims to provide first- each time period 

person videos of interaction-level activities, recording 
how things look from the perspective of a person/robot 
participating in physical interactions. 

NUS Eirst-person Inter- Dataset for interaction recognition with 8 interactions in Interaction type GoPro 

action Dataset fTOl 2 perspectives (first-person and third-person) resulting 

in 16 classes in total. The dataset will be made publicly 
available at a later date. It contains 2 human-human 
interactions, 2 human-object-human interactions and 4 
human-object interaction classes. It contains 260 videos 
with at least 15 samples in each class. 


CMU Multi-Modal Ac¬ 
tivity Database (CMU- 
MMAC) da 

Multimodal dataset of 18 subjects cooking 5 different 
recipes (brownies, pizza, etc.); also contains audio, body 
motion capture, and IMU data. 

Erame-level action 

No Infor¬ 
mation 

CMU EDSH (hands un¬ 
der varying illumina¬ 
tions) 153 

Dataset of over 600 hand images taken under various 
illumination conditions and different backgrounds. Each 
image is segmented at the pixel level. 

Hand segmentation 

GoPro 

EgoHands Dataset Q 

Contains 48 Google Glass videos of complex, first- 
person interactions between two people. The main in¬ 
tention of this dataset is to enable better, data-driven 
approaches to understand hands in first-person computer 
vision. 

Hand segmentation 

Google 

Glass 

Unige-Hands 

CD 

Dataset 

Videos recorded in 5 different locations (office, street, 
bench, kitchen and coffee bar) intended for hand detec¬ 
tion. 

Hand/No Hand label 
per frame 

GoPro 


Yale Human Grasp Dataset with 27.7 hours of tagged video recorded by two Grasp tagging, and RageCams 

Dataset 1^ housekeepers and two machinists during their regular interval and object 

work activities. It includes the tagged grasp type with labels 
its time information, objects manipulated and parameters 
of the performed task. 














JOURNAL OF TRANSACTIONS ON HUMAN-MACHINE SYSTEMS JULY 2015 


15 


UT Grasp Data Set fTH 

Dataset under controlled environment performed by four 
different subjects. They were asked to grasp a set of 
objects placed on a desktop with specific types of grasps. 
The most common subset of 17 grasp types from Eeix’s 
Taxonomy 1^ were selected to perform these everyday 
activities. 

Hand grasp type and 
start/end frame num¬ 
ber 

GoPro 

Life-logging EgoceNtric 
Activities (LENA) l(84ll 

Egocentric video database containing 13 categories of 
activities relevant to lifelogging applications performed 
by 10 different subjects. Each subject recorded 2 clips 
for one activity (20 clips per activity). Each clip has a 
duration of 30 seconds. 

Activities performed. 

Google 

Glass 

COGNITO nni 

Non-periodic manipulative tasks in an industrial context. 
All the video sequences were captured with on-body 
sensors consisting of IMUs, a backpack-mounted RGB- 
D camera for top-view and a chest-mounted fish-eye 
camera for the front view of the workbench. 

Activity labels and 
objects and wrist 
tracklets 

RGB-D 

and others 

Michigan-Milan Indoor 
Dataset 

With 10 video sequences collected with common smart¬ 
phones in a variety of environments, including offices, 
corridors and large rooms, where the observer moves 
freely (6 DoE) around the scene. 

Image segmentations 
with the labels 
’’ceiling”, ’’floor” or 
’’wall” 

Smartphone 

Bristol Egocentric Ob¬ 
ject Interactions Dataset 

m 

Dataset captured with wearable gaze tracker software 
containing various pre-defined actions of daily living 
in different indoor locations (kitchen, workspace, gym, 
laser printer, corridor and weight-lifting machine). The 
videos in each sequence are recorded by 3-5 different 
users. 

3D maps and 3D ob¬ 
jects GT 

ASL 

Mobile 
Eye XG 

DogCentric Activity 

Dataset ll48]| 

DogCentric Activity Dataset is composed of dog activity 
videos taken from a first-person animal viewpoint. The 
dataset contains 10 different types of activities, includ¬ 
ing activities performed by the dog itself, interactions 
between people and the dog, and activities performed 
by people or cars. The videos are in 320x240 image 
resolution, 48 frames per second. 

Activity performed 

GoPro 

UEC EgoAction Dataset 

ED 

A set of videos (acquired by the researchers or public 
from YouTube) recording different sports (skiing, moun¬ 
tain biking, etc.). Each video is several minutes long and 
contains a wide set of actions performed by the user. 

Activities performed 

GoPro 
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TABLE IV: List of the most relevant public software related to 
egocentric vision. 


Alireza Fathi's Egocentric Vision 
Toolbox (361, (371, m 

Toolbox including functions for applying different data processing to egocentric 
videos, including motion estimation, image segmentation, object classification and 
action classification among others. 

OpenCV and CUBA 

http://ai.stanford.edu/~alireza/GTEA_Gaze_Website/Code/index.html 

Ego-Object Discovery (161, (T^ 

Object Discovery Algorithm on Egocentric Images. Semi-supervised algorithm that 
uses initial object proposal generation, a CNN-based feature representation, false 
positive filtering, and an interactive object discovery with Refill strategy. 

Matlab and Caffe 

https://github.com/MarcBS/Ego-Object_Discovery 

Detecting Activities of Daily Liv¬ 
ing in First-person Camera Views 

ItTI 

Train and test code for the problem of detecting activities of daily living (ADL). It 
applies novel representations including temporal pyramids to approximate temporal 
correspondences, and composite object models that exploit the differences between 
the objects when being interacted with. 

Matlab 

http://people.csail.mit.edu/hpirsiav/codes/ADLdataset/adl.html 

Temporal Pooling of CNN Vectors 

(801 

It includes the pooled time series (PoT) representation framework as well as basic 
per-frame descriptor extractions including a histogram of optical fiows (HOF) and 
histogram of oriented gradients (HOG). 

Java and OpenCV [exec, only] 

https://github.com/mryoo/pooled_time_series/ 

Temporal Segmentation of Egocen¬ 
tric Videos (731 

Software for segmentation and event classification of egocentric HTR videos. It 
applies a hierarchical classification using cumulative displacement curves. 

Matlab and 

http://www.vision.huji.ac.il/egoseg/ 

Doherty Wearable Camera Browser 

(Ml 

Application for data segmentation annotation and browsing. It supports analysis of 
images from the following photographic cameras: Vicon Autographer, Revue, or 
SenseCam. 

[exec, only] 

http://sensecambrowser.codeplex.com/ 

R-Clustering for Event Segmenta¬ 
tion (SHI 

Segmentation of events in egocentric lifelogging photo streams. It uses convolutional 
neural network features and an energy minimisation (Graph-Cut) technique to 
segment photo sequences. 

Matlab and Caffe 

https ://github.com/MarcB S/SR- Clustering 

Motion-Based Egocentric Segmen¬ 
tation f]M 

It applies a robust SIFT-Flow motion estimation suitable for photo sequences to 
perform photo stream segmentation in motion-related events. 

Matlab 

https ://github.com/MarcB S/Motion_Video_Segmentation 

Egocentric Vision Keyframe Sum¬ 
marization (TSl 

The code extracts a visual summary of a set of egocentric images captured by a 
photo camera. The result is a collage with one image summarizing every event in 
the image set. It uses a frame representation by means of a convolutional neural 
network followed by an event segmentation based on agglomerative clustering and 
keyframe selection based on Random Walk. 

Matlab and Caffe 

https ://github.com/MarcB S/Egocentric- Visual- Keyframes- Summary 

Egocentric Snap Points Detection 

(961 

Automatic prediction of snap points in unedited egocentric video that is, those frames 
that look as if they could be photos taken intentionally. It makes use of a generative 
model for snap points that rely on a photo prior to intentional (conventional) images 
together with domain-adapted features. 

Matlab and C 

https ://github.com/bxiong1202/snap-points 


















